Week 6 of 12 · Part B — Alignment Literacy

The Alignment Problem, Assembled

Halfway. The four pieces of Week 6 become one argument you can defend on a whiteboard

Day 30 ~50 minutes Review

Day 30 of 60

What you now hold

You crossed into Part B and came out alignment-literate. You can separate outer alignment (is the objective what we want?) from inner alignment (did the model learn that objective or a proxy?); you've watched reward hacking open the outer gap under Goodhart pressure; you've seen goal misgeneralization defeat even a correct reward; and you know why RLHF has a ceiling and why its alignment can be shallow enough to strip in a few tokens. This is the lens for everything in Weeks 7–9.

The through-line of Week 6

The alignment problem is one sentence: the target isn't the thing. We can only specify and optimize proxies, so a capable optimizer pursues the proxy — outer (wrong objective) or inner (right objective, wrong learned goal) — and pushing harder widens the gap rather than closing it. That's why capability does not buy alignment, and why "more RLHF" is not, by itself, a plan.

The one argument, in four moves

The Assembly

1 · Outer vs inner — two places to fail

Outer: the specified objective is a proxy for what we want, and the gap never fully closes. Inner: even a perfect objective can be internalized as a correlated proxy goal. Different failures, different fixes — name which before you reach for one.

2 · Reward hacking — the outer gap under pressure

Goodhart's law: optimize a proxy hard and it stops tracking the goal. The toy from Day 27 is the whole problem on one chart — the proxy-best answer walks off the true peak.

3 · Goal misgeneralization — the inner gap, post-training

With a correct reward, many goals fit the training behavior; the model can stay competent and pursue the wrong one out of distribution. In-distribution evals can't see it — you have to engineer the shift that separates goal from proxy.

4 · The limits of RLHF — why the obvious fix has a ceiling

Flawed feedback, a Goodhart-able reward model, gameable policy optimization — and alignment shallow enough to strip in a few tokens. The fix that everything relies on is necessary, not sufficient.

The reflection ritual still runs

The judgment-call discipline from Week 1 didn't stop being relevant when the topic got theoretical. When you label a failure outer vs inner, or claim a system is "reward hacking," that's a call others should be able to challenge. Check the written result (which paper actually showed this?), check precedent (is this the demonstrated claim or the headline?), and when it's genuinely ambiguous, say so precisely. Alignment literacy is worth most when you can state exactly what has and hasn't been shown.

Your work today

Prove the week — a case analysis

~50 minutes

Self-quiz, no notes: define outer alignment, inner alignment, reward hacking, goal misgeneralization, and shallow alignment — each in one sentence.
Write a specification-gaming / reward-hacking case analysis: pick one real system and trace the failure from objective → proxy → behavior, naming whether it's an outer or inner failure and what would have detected it.
Re-skim §1–3 of The Alignment Problem from a Deep Learning Perspective and confirm you can place each Week 6 concept inside its frame.
Write your Week 6 summary in your own words, plus the one alignment question you most want Weeks 7–9 to answer.
Honors: make the case analysis trace objective → proxy → behavior end to end, and explain the alignment problem to someone non-technical in three sentences.

The expert move

A practitioner can recite the terms. An expert can assemble them into one argument — outer/inner, reward hacking, goal misgeneralization, the RLHF ceiling — and trace a single real failure all the way from objective to behavior, naming what would have caught it. That synthesis is what separates someone who has read the alignment papers from someone who can reason with them in a room where a deployment decision is being made.

Say this in an interview: "The alignment problem comes down to one thing: we can only optimize proxies, so a capable system pursues the proxy, not the goal — as an outer failure when the objective is wrong, or an inner failure when it learned a correlated goal. Reward hacking and goal misgeneralization are the two faces of that, and RLHF's shallow, gameable alignment is why more of it isn't the answer."

Week 6 Takeaways

The alignment problem in one line: the target isn't the thing — we optimize proxies, so capable systems chase proxies.
Outer vs inner, reward hacking, and goal misgeneralization are the core vocabulary — and they assemble into one argument.
RLHF is necessary, not sufficient: flawed feedback, a Goodhart-able reward model, and shallow alignment you can strip in a few tokens.
Next: if alignment can fail invisibly, can a model fake it during evaluation? Week 7 reads the actual evidence.